We now know enough to put together a data analysis project from beginning to end. In the next few chapters, I will lay out the flow of a project, the steps we follow to put together a project using the structure presented by Garrett Grolemund and Hadley Wickham in their basic book on R and the Tidyverse1, R for Data Science, that I mentioned in Chapter 2.

Import Data

The first step in any analysis using R is to import the data from whatever source produces it. Most biological data, whether it comes from machines or from web sites, comes in a variant of the types of files that Excel produces, either “.csv” or “.xlsx” files, that is spreadsheets. The major exception to this are biological sequences that usually are found in FASTA format (“.fasta”). FASTA and “.csv” files are variants of simple text files that any file editor can read. However, there are specific functions that R uses for each that include other capabilities ease the importation process. We will consider FASTA files and other specific bioinformatic functions in a later chapter.

Tidy the Data

Tidy the data refers to ensuring the data appears in a format that is useful. Many data sets that you will import have been collected and stored haphazardly without reference to how you as a researcher are going to use that data. Most, if not all, data sets need some reorganization to make them useful. The work in this phase of a project is to see what you have and fix problems that you can identify. These problems may include finding many data points with missing data or having data with impossible values. For example, I recently worked with a data set that reported someone’s age as 135. You need to catch all these errors, whether they be simple typographic errors or serious errors of data that don’t make sense.

“Tidy Data”

There is a second meaning of “tidy” in use today among R users. Tidy refers here to a set of principles about the organization of data that is useful for many analyses. Many functions in R, especially those of the Tidyverse2, assume that data will be presented in a tidy format. The hallmark of tidy data is the consistency of its format.

Tidyverse

Tidyverse

Hadley Wickham, the originator and moving force behind the Tidyverse has defined “tidy data” much better than I can.

Like families, tidy datasets are all alike, but every messy dataset is messy in its own way…. A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organized in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes…. Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.3

Tidy data obey three rules:

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.4

How Do We Tidy Data?

In order to put data in tidy form, both in terms of the three rules and in terms of cleaning up obvious difficulties, we will use a number of functions from various packages to examine the data and to make some basic graphs that show us what the data say and where problems may lie. We will then use functions from the Tidyverse’s data manipulation package, dplyr, and others to put the data into shape.

Explore the Data

This is the phase of data analysis that you originally thought was all of data analysis. This is where you build models, evaluate them, test hypotheses, make graphs you want to present to others and answer the questions that motivated your project in the first place. Given the nature of data analysis, you will be going through the stages of this phase repeatedly until you are satisfied with the answers you have achieved.

While this phase of data analysis may seem the most important (It is certainly the most satisfying.), it is not where you will spend most of your time. Tidying data takes much longer than exploring it. The normal expectations are that you will spend 60 - 80 percent of your hours on a project tidying it and only 20 - 40 percent on actually doing what you set out to do. So, don’t be surprised or worried if you need more time to organize your data than you initially planned on. That is normal.

Communicate Your Results

When you are ready, R Markdown will be your friend in presenting your results to your audiences in the form of papers, reports and slide presentations. R Markdown includes a wide variety of formats that fit most means you will need to use to communicate your results. R Markdown slides and documents can include R generated tables and graphics.

We will now explore each of these phases in more detail as we work through some example projects.


  1. Wickham, Hadley, e Garrett Grolemund. R for data science: import, tidy, transform, visualize, and model data. First edition. Sebastopol, CA: O’Reilly, 2016. Also available at: https://r4ds.had.co.nz/↩︎

  2. The tidyverse is a set of packages largely the result of the work of Hadley Wickham of RStudio. He refers to it as an “opinionated” collection of packages. They do many things differently than the base R system. While the tidyverse packages offer many advantages, not all R specialists are fans. However, they make the work of an R beginner easier to understand. We will use most of the packages in this course.↩︎

  3. Wickham, Hadley. “Tidy Data”. Journal of Statistical Software 59, 10 (2014). https://doi.org/10.18637/jss.v059.i10.↩︎

  4. Wickham and Grolemund, Ch. 12.1.↩︎